CONTRIBUTIONS TO PARALLEL AND DISTRIBUTED COMPUTING IN KNOWLEDGE DISCOVERY AND DATA MINING By
نویسنده
چکیده
Recently databases are increasing continuously without bound, due to new data acquisition technologies. One challenge is how to gain knowledge from these large data sets. In this thesis, we analyze and improve the algorithmic solution of four problems related to knowledge discovery and data mining, making use of parallel computing; we also compare our results with related works. We design two parallel algorithms for outlier detection; the first one is for finding distance-based outliers based on nested loops along with randomization and the use of a pruning rule. The second parallel algorithm is for detecting density-based local outliers. In both cases data parallelism is used. The star coordinates plot is a useful visualization technique, but it has some drawbacks. We enhance the traditional star coordinates plot introducing new parameters that will allow us to visualize the data points in two dimensions as polygons and in three dimensions as polyhedrons. In order to visualize large data sets and reduce its computational time, a parallel algorithm is also designed. We design a new meta-classifier algorithm, and its performance is compared with base classifier algorithms and Bagged based meta-classifier algorithms. Our meta-classifier algorithm gives better results compared to other meta-classifier algorithms. For speeding up its computation time as well as making it suitable for large data sets a parallel algorithm is developed. We develop a meta-clustering algorithm and compare its performance with two Bagged based meta-clustering algorithms, and hypergraph partitioning meta-clustering algorithm. Our proposed meta-clustering algorithm gives results close to the best clustering algorithm, and is more robust to the data dependency problem. A parallel algorithm to compute four meta-clustering algorithm is also designed. The experimental results of our collection of sequential and parallel programs is tested in two different clusters of Linux-based workstations using real-world databases available in the Machine Learning Repository of the University of California at Irvine.
منابع مشابه
Grid - based Distributed Data Mining Systems , Algorithms and Services ∗
Distribution of data and computation allows for solving larger problems and execute applications that are distributed in nature. The Grid is a distributed computing infrastructure that enables coordinated resource sharing within dynamic organizations consisting of individuals, institutions, and resources. The Grid extends the distributed and parallel computing paradigms allowing resource negoti...
متن کاملTopic 05: Parallel and Distributed Databases, Data Mining and Knowledge Discovery
Managing and eciently analysing the vast amounts of data produced by a huge variety of data sources is one of the big challenges in computer science. The development and implementation of algorithms and applications that can extract information diamonds from these ultra-large, and often distributed, databases is a key challenge for the design of future data management infrastructures. To-day's ...
متن کاملDesign of Distributed Data Mining Applications on the KNOWLEDGE GRID
Many industrial, scientific, and commercial applications need to analyze large data sets maintained over geographically distributed sites. The geographic distribution and the large amount of data involved often oblige designers to use distributed and parallel systems. The Grid can play a significant role in providing an effective computational support for distributed data mining and knowledge d...
متن کاملKnowledge Discovery on the Grid
In the last few decades, Grid technologies have emerged as an important area in parallel and distributed computing. The Grid can be seen as a computational and large-scale support, and even in some cases as a high-performance support. In recent years, the data mining community have been increasingly using Grid facilities to store, share, manage and mine large-scale data-driven applications. Ind...
متن کاملParallel Rule Mining with Dynamic Data Distribution under Heterogeneous Cluster Environment
Big data mining methods supports knowledge discovery on high scalable, high volume and high velocity data elements. The cloud computing environment provides computational and storage resources for the big data mining process. Hadoop is a widely used parallel and distributed computing platform for big data analysis and manages the homogeneous and heterogeneous computing models. The MapReduce fra...
متن کامل